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ABSTRACT 

The P.E.P. Report 1969-1973 focuses on the various 
findings and activities of the Program Evaluation Project. 
Reliability is considered a basic aspect of any measurement system. 
With Goal Attainment Scaling, at least two types of reliability are 
important: the reliability of the followup guide construction and the 
reliability of the followup guide scoring. This chapter discusses the 
theory underlying applications of conventional reliability concepts 
to Goal Attainment Scaling and reviews a range of studies relevant tc 
the reliability of the methodology. This chapter is designed to give 
a general introduction to reliability and Goal Attainment Scaling. 
(Author/RC) 
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GENERAL INTRODUCTION TO THE P.E.P. REPORT 1969-1973 



The P.E.P. Report 1969-1973 focuses on the various findings and activities of the Program Evaluation 
Project. It is being published in pamphlet form with one pamphlet for each chapter. 

As of January, 1974, the Program Evaluation Project is funded by a three year collaborative grant 
with the Mental Health Services Division of the National Institute of Mental Health. The purpose of the 
grant is to emphasize the coordination and dissemination of information on a variety of program evaluation 
methodologies. Currently, it is expected that the title of the organization will be changed to the Program 
Evaluation Resource Cr.nter during 1974. 

Further information on the Goal Attainment Scaling methodology and program evaluation is available in 
other written and recorded materials from the Program Evaluation Project office. Chapter One, "Basic Goal 
Attainment Scaling Procedures"; Chapter Five, "A Construct Validity Overview of Goal Attainment Scaling"; 
and Chapter Nine, "Evaluation of the Adult Outpatient Program, Hennepin County Mental Health Service" of 
the P.E.P. Report 1969-1973 are now available. Additional chapters will be released this year as they are 
completed. 
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SYNOPSIS FOR CHAPTER THREE 
AN INTRODUCTION TO RELIABILITY AND THE GOAL ATTAINMENT SCALING METHODOLOGY 



PURPOSE: Reliability is considered to be a basic .ispect of any measurement syst-.n. ,iith Goal Attainment 
Scaling, at least two types of reliability are important: the reliability of the follow-up guide con- 
struction and the reliability of the follow-up guide scoring. This chapter discusses the theory under- 
lying applications of conventional reliability concepts to Goal Attainment Scaling and reviews a range of 
studies relevant to the reliability of the methodology. This chapter is designed to give a general in- 
troduction to reliability and Goal Attainment Scaling. Another P.E.P. Report 1969-1973 chapter on 
reliability discusses one particular study in dept and will be released later. 



MAJOR FINDINGS : In the examination of Goal Attainment Scaling reliability, it should be remembered that 

any outcome evaluation methodology is designed to vary from one point of measurement to another. Thus, 
; because Goal Attainment Scaling is such an evaluation-oriented measurement, it should not be expected to 

Droduce identical results if the same client is scored at different .times . As a result of this charac- 
' teristic and of difficulties in applying other conventional reliability approaches to Goal Attainment 

Scaling, most of the available reliability studies concentrate on inter-rater agreement with a few others 

being concerned with alternate fonn reliability. 

In the original reliability study, which is discussed in greater detail in another P.E.P. Report 
1969-1973 chapter , for each of 44 clients at the outpatient unit, one follow-up guide was constructed by 

■ the intake interviewer and a second follow-up guide was made somewhat later by the therapist. These two 
follow-up guides were combined and then scored twice at two separate interviews by two different raters. 
For the follow-up guide prepared by the intake interviewer, the Goal Attainment scores from the two in- 
terviews were correlated .711 and for the follow-up guide prepared by the therapist, scores from the two 

I interviews correlated .625. The other P.E.P. Report 1969-1973 chapter on reliability discusses this 

■ study more intensively (Shennan, Baxter, Audette, 1974). 
I 

! A variety of other reliability studies are discussed in less detail. In the interdiscip.linary re- 

j liability study, 60 clients were interviewed twice on the basis of follow-up guides constructed by in- 
' take interviewers, with the interviews being conducted either by nurses or social workers and either by 

■ telephone or in person. For this study, the Goal Attainment scores from the first and second interviews 
were correlated .65, and there were not significant differences in mean scores between the two types of 
interviewers or between the telephone and in-person interviews. 

Other studits cover data from multiple raters scoring a videotape interview, multiple rater scorings 
of organizational goals, comparison of client and follow-up interview ratings (correlation of .71), and 
multiple ratings over time in the drug effectiveness study. 
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I. Theory and Background on Goal Attainment 
Scaling Reliability ~ 

Reliability Is a basic property of an In- 
strument or methodology designed for practical 
use. (Goal Attainment Scaling Is generally re- 
ferred to jn this discussion as a "methodology" 
or "technique" rather than a "test", since tests 
are often interpreted as tasks done by an In- 
dividual rather than tasks which are done for or 
to the individual. [Kelly. 1967]). Goal Attain- 
ment Scaling is a relatively unique methodology, 
and because of its nonstandardlzed , outcome- 
oriented nature, most of the classical concepts 
of reliability have to be Interpreted loosely 
•yhen applied to it. However, since reliability 
IS of such central Importance to many persons 
utilizing the methodology, the author in th.s 
chapter shows one way in which Goal Attainment 
Scaling reliability could be approached. 

This section of the chapter discusses both 
the theory and data on reliability concepts as 
applied to Goal Attainment Scaling. The theo- 
retical discussion which begins this section of 
the chapter is followed by a sumnary of several 
reliability investigations undertaken by the 
Program Evaluation Project staff. (As noted in 
the synopsis, a second chapter on reliability 
written by Sherman, Baxter, Audette, presents a 
more intensive analysis of the original Program 
Evaluation Project reliability study.) 

In this chapter, the psychometric complexity 
of studying reliability in a new Instrument is 
underscored. According to Cronbach, one of the 
major psychometric theorists, '*any research based 
on measurement must be concerned with the accu- 
racy or dependability or as we usually call it 
reliability of measurement. A reliability coeffi- 
cient demonstrates whether the test designed was 
correct in expecting a certain collection of 
items to yield interpretable statements about in- 
dividual differences." (1951) Although such a 
definition can at least conceptually be applied 
to the reliability of Goal Attainment Scaling, 
there are at least two immediate problems in such 
an application. These two problems are summarized 
in the pair of Cronbach 's phrases: 



• "...a certain collection of items... 
and 

• "...statements about individual dif- 
ferences." 



When Cronbach refers to "...a certain col- 
lection of items...", he is proceeding from 
the typical background of testing theory, which 
is based largely on fixed-item aptitude or in- 
formation cebts and inventories of opinion. 
However, Goal Attainment Scaling is not a "test" 
in many of its applications, as noted previously, 
but an indicator of outcomes. Thus, the Goal 
Attainment score is not basically a "...statement 
about individual differences". In fact. Goal At- 



tainment Scaling is designed to minimize the 
Impact of individual differences on the Goal 
tainment score by orienting measurement to the 
expected" or "best prediction" outcome. Goal 
Attainment Scaling is usually based on a flex- 
ible (as opposed to "certain") set of Items 
since the scales are individually developed for 
each client. 



A. 



Tests of Reliability Appropriate to 
Goal Attainment Scaling 



An indicator of outcomes of treatment or 
other prccsssG3 most be able to vary from one 
point of measurement to another point (usually 
these are points in "time" since most treatment 
proceeds through a time period) if there is to 
be any possibility of evaluating the treatment. 
One of the assumptions of current theories of 
treatment outcome is that most people do change 
(spontaneously or otherwise) during mental health 
treatment and that these changes occur at differ- 
ing rates over time. 

A similar situation exists with respect to 
the Minnesota Multiphasic Personality Inventory 
and it is reported that: "The limitations in the 
methods of reliability estimation based upon re- 
test data can be readily discerned as they apply 
to instruments like thfi Minnesota Multiphasic 
Personality Inventory. The usual index of score 
stability IS based upon the degree of correspond- 
ence between the ranking of subjects in a group 
on two different occasions, summarized in a cor-, 
re ation coefficient. To Interpret such a cor- 
relation as a gauge of the dependability of the 
scores on some scales, it is necessary to assume 
that the rankings of the group from the first to 
the second testing should not change except 
through errors in the measurement of their 
sitions. Since there is scarcely any scale' on 
the Minnesota Multiphasic Personality Inventory 
for which this general assumption is tenable for 
any period longer than a day or two, the various 
estimates of scale stability published on the 
basic Minnesota Multiphasic Personality Inven- 
tory scales cannot be readily Interpreted as 
indices of the Inherent dependability of the 
scores from these scales obtained on any one oc- 
casion." (Dahlstrom, 1969) A similar obser- 
vation was made much earlier by Hathaway (195()) 
who argues that: 

"It is always difficult to evaluate any of the 
usual reliability data on personality measures 
that are likely to show valid time-related vari- 
ance in the individual subjects. All the Minne- 
sota Multiphasic Personality Inventory scbVi 
are sensitive to therapeutic and other effects. 
Since both the motivation and the life situation 
of the subjects are likely to change almost mo- 
mentarily, it is always possible that an ob- 
served change in score is valid variance Instead 
of error variance." (1956) Like the MMPI the- 
orists, the Rorschach expert Holzberg (I960) 
contends that "it cannot be assumed that the ob- 
ject of study, the personality of a subject, is 
unchanging. Si^aificant aspects of t^,-^ person- 
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ality change through time in response to internal 
and external factors," (p. 368) and concludes 
"...that the traditional methods of assessing psy- 
chometric reliability are inappropriate to the 
Rorschach ..." (p. 377). 

Anastasi (1968) states that "...in its 
broadest sense, test reliability indicates that 
extent to which individual differences in test 
scores are attributable to true differences ^'n 
the characteristics under consideration and the 
extent to which they are attributable to chance 
errors... Factors that might be considered error 
/ariance for one purpose would be classified 
jnder true variance for another. For instance, 
if we are interested in measuring fluctuations 
of mood, then the day-by-day changes in score 
on a test of cheerfulness-depression would be 
relevant to the purpose of the test and would 
hence be part of the true variance of the scores." 
Thus, experience with other test systems, all of 
which share some but not all of the sources of 
reliability associated with the Goal Attainment 
Scaling methodology, suggest that variation over 
time does not always indicate a deficiency in 
the measurement instrument. In the case of out- 
come-oriented measurements, change over time 
could be meaningful variation, rather than 
"unrel iabil i ty" . 

The above citations refer basically to "re- 
test" measures of reliability. There is much 
disagreement anx)ng psychometricians as to the 
proper classification of the different approaches 
to reliability estimation. Anastasi, however, 
advances a useful list of five types of reli- 
ability estimates. 

a. Test-retest (giving the same test 
twice). Uses two or more testing 
sessions . 

b. Alternate-form (also called parallel 
forms by other experts). Uses only 
one testing session. 

c. Split- ha If (dividing the test into 
two halves). Uses only one testing 
session, 

d. Kuder-Richardson (the internal coii- 
sistency or inter-item homogeneity 
approach). Uses only one te*^ting 
session . 

e. Scorer accuracy (accuracy of scoring 
the test). Uses only one testing 
session . 

As already emphasized, alrnost all applica- 
tions of reliability theory to boal Attainment 
Scaling involve some extension or special modi- 
fications of theory. The application to the 
Goal Attainment score of each of these five types 
of reliability estimates is discussed below. 



TYPE 1. TEST-RETEST RELT/^BILITY 



Because the Goal Attainment score would be 
expected to vary meaningfully over time, reliability 
measurements should be based on observations which 
are essentially carried out at the same time. When 
test-retest scores are used for reliability esti- 
mates, any significant gap in time between test and 
retest will deflate the reliability estimate. Thus, 
the test-retest approach to reliability estimation 
is probably not appropriate to Goal Attainment 
Scaling. 

TYPE 2. ALTERNATE FORM RELIABILITY 

When one follow-up guide is constructed by 
the client's intake interviewer and a second fol- 
low-up guidfc is constructed by the client's thera- 
pist, as in the original reliability study, it 
could be said that two "alternate forms" have 
been developed. Many other pairs of alternate 
formo could be constructed, of course, such as 
client-constructed follow-up guides versus ther- 
apist-constructed follow-up guides, and so on. 
Most investigations of Goal Attainment Follow- 
up Guides' content reliability depend, in effect, 
on the construction of "alternate forms" of a 
Goal Attainment Follow-up Guide for a client. 

However, the construction of two or more 
follow-up guides for the same client might more 
simply be conceptualized as a form of "inter- 
rater agreement" reliability. If two or more 
raters have a high degree of agreement in the 
contents and outcomes as measured by the follow- 
up guides they construct, the reliability estimate 
is high. Of course, the usual concept of inter- 
rater agreement does not involve the actual de- 
velopment of the items by the raters, but in the 
case of Goal Attainment Scaling, the construction 
of the scales, especially the expected levels, 
could be thought of as a form of predictive rating 
of the client s expected outcome. 

TYPE 3. SPLIT-HALF RELIABILITY 

This reliability estimate is not appropriate 
to Goal Attainment Scaling. It involves splitting 
the tsst into comparable halves by some method 
and is a form of internal consistency reliability. 

This reliab ■*:y estimate is based on the 
concept that variance should be similar for both 
halves of the instrument and that the scales 
will have high int<ir-correlationc. buch assump- 
tions do not hold for Goal Attainment Scaling. 

TYPE 4. INTERNAL CONSISTENCY RELIABILITY 

The same reservations cited for the Split- 
Half reliability er*timate -.pply even more strongly 
for the Kuder-Richardson method, which is an ex- 
tension of the internal consistency concept. This 
approach is probably not useful where the scores 
or contents are expected to vary over time. 

Nunnally (1967), after presenting the pos- 
sibility of meaningful (i.e., valid) changes in 
subjects over time argues that "Systematic dif- 
ferences in contents of tests of variations in 

6 
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people from one occasion to another cannot be 
adequately handled by a model which is based on 
the random sampling of items. For adequately 
handling these factors, the model must be ex- 
tended to consider the random sampling of whole 
tests , in which case the tests are thought of 
as being randomly sampled for particular occasions 
and correlations among tests are permitted to be 
somewhat lower than would be predicted from the 
correlations among items within tests. In that 
case, the average correlation among a number of 
alternative forms administered on different oc- 
casions, or the correlation between o ''y two such 
forms, would be a better estimate than that pro- 
vided by coefficient alpha or KR-20." Nunnally's 
arguments imply that for tests measuring variables 
expected to vary over time, such as Goal Attain- 
ment Scaling, internal consistency estimates of 
reliability are not necessarily appropriate. 

The related alpha coefficient of Cronbach 
also demands inter-item homogeneity. As noted be- 
fore, homogeneity is not necessarily assumed for 
the Goal Attainment Follow-up Guide of a client. 



TYPE 5. SCORER ACCURACY RELIABILITY 

For observations where there is some degree 
of subjectivity in the scoring or rating, two or 
more scorers can rate each result. The result- 
ing scores are correlated to produce an estimate 
of the agreement or accuracy of the scoring. In 
the case of Goal Attainment Scaling, it could be 
said that there is a second type of "scorer ac- 
curacy" reliability estimate based on the simi- 
larity of contents on follow-up guides for two 
or more raters (as mentioned above, this use of 
the inter-rater agreement idea in this context 
is unusual but possiblel). 

A number of studies of Goal Attainment Scaling 
reliability have been undertaken, but almost all in- 
volve the "inter-rater agreement" approach to reli- 
ability measuring inter-rater agreement on either 
the construction of the Goal Attainment Follow-up 
Guide or scoring of the Goal Attainment Follow-up 
Guide. Figure I lists the Program Evaluation Pro- 
ject reliability studies which are discussed in 
this chapter. 



FIGURE I: Program Evaluation Project Studies of 
Goal Attainment Scaling Reliability 



STUDY 


INTER-RATER AGREEMENT 
IN CONSTRUCTION OF GOAL 
ATTAINMFNT FHi 1 nw-IIP 
GUIDES (SCORER ACCURACY 
OR ALTERNATE FORM) 


INTER-RATER AGREEMENT 
IN SCORING GOAL ATTAIN- 
MENT FOLLOW-UP GUIDES 
(SCORER ACCURACY) 


ORIGINAL RELIABILITY 
STUDY 


Intake Interviewer vs 
Therapist Goal Attain- 
ment Follow-up Guides 


Rater 1 vs Rater 2 
(Two sessions) 


INTERDISCIPLINARY 
RELIABILITY STUDY 




In-person vs Telephone 

Rater 1 vs Rater 2 
(Two sessions) 


RE-DESIGN VALIDITY/ 
RELIABILITY STUDY 




Therapist vs Independent 

Interviewer 

(Two sessions) 


GUIDE TO GOALS STUDY 
PHASE 2 


CI ient vs Intake Worker 
Goal Attainment Follow- 
up Guide Sti'dy 


CI ient vs Independent 

Interviewer 

(One session) 


VIDEOTAPE 


Multiple Follow-up Guide 
Constructors 


Multiple Scorers 
(One session) 


DRUG STUDY 




Three Raters 

Time 1 vs Time 2 vs 

Time 3 

(One session per time) 
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B. Reporting Reliability Results 

1. Estimates of Reliability 

All single experimental measures of reliabil- 
ity are estimates derived from one particular sample 
of clients, clinicians, raters, ano so on, of the 
theoretical "true" degree of reliability which is 
the mean of a theoretical distribution of reliabil- 
ity scores. In practice, except for this hypothet- 
ical true mean reliability score, there is no single 
or absolute reliability correlation for an instru- 
ment. The rel iabil i ty may vary depending on the 
situation in which the instrument is applied (e.g., 
if the raters are inexperienced, a lower inter- rater 
reliability score might be expected. Anastasi 
notes, for example, that even for the venerable 
Stanford-Binet, the reliability coefficient 'aries 
from .83 to .98 for various ages and I.Q. lev?ls. 
(Anastasi , 1968) " 

2. The Coefficient of Correlation and 
Reliability Estimates 

The coimion use of the coefficient correlation 
is a matter of tradition or convenience. Reliabil- 
ity estimates are most completely expressed by 
descriptions of the components of variance which 
are due to various true score and error compon- 
ents. However, for many persons in the human 
service field, correlation coefficients are more 
familiar than analysis of variance of error com- 
ponents. Unless noted otherwise, all coefficients 
are based on the Pearson Product Moment correlation. 

3. Percentage of Agreement and Reliability 

At times, various percentage of agreement 
measures are used. Such measures are intended to 
suggest the degree of inter- rater agreement of 
Goal Attainment scores, but are not directly com- 
parable to reliability coefficients. 

4. Two Methods of Expressing Goal Attainment 
Scores 

■'he correlation coefficients may be based on 
either of two different expressions of the Goal 
Attainment scores. The first expression is the 
traditional Ki resuk-Sherman Goal Attainment score, 
which gives a single, sumnary score for the entire 
Goal Attainment Follow-up Guide, that is, one 
score per client in most cases. 



Goal 

Attainment 
score: 



50 + 



10 wj^xj^ 
i = i 



The second expression is the scale-by-scale score, 
which is a simple mean of the scale scores on a 
single follow-up guide. Each Goal Attainment 
Follow-up Guide, of course, is made up of several 
individually developed "scales", which could also 
be called items. The scale-by-scale scores may 
be prese • ' either in a -2 or +2 range or a 1 to 
9 range I to 9 range is more convenient for 

computer c.itry, so that +2 is equivalent to 9, +1.5 
is equivalent to 8, +1.0 is equivalent to 7 and 



so on.) Scale-by-scale analyses assume, in effect, 
that the Goal Attainmevnt Follow-up Guide is some- 
what akit) to an inventory or test composed of a 
number of semi-Li kert- type scales. For various pur- 
poses, investigators may be interested in either 
the total Goal Att?inment score or in the scores 
on individual scales on a follow-up guide. 



1 1 . Summaries of Reliability Studies 

In this section, the reliability studies will 
be presented roughly in chronological order, ex- 
cept for the internal consistency measures, which 
will be presented together in a single subsection. 
The following coinnents summarize the correlational 
results of the original reliability study. 

A. The Original Reliability Study Results 

The original reliability study was based 
on two independent scorings of eighty-eight 
follow-up guides for forty-four cases. These 
forty-four cases were those in a largar group 
which met a range of criteria as described in the 
Sherman, Baxter, and Audette reliability chapter 
(1974). The study produced the following inter- 
rater agreement reliability estimates of scorer 
reliability and follow-up guide constructor re- 
liability. Each client was represented by two 
follow-up guides, one constructed by the client's 
intake interviewer and the second constructed by 
the client's therapist approximately three weeks 
later. Scales from the two follow-up guides 
were intermixed into a single composite guide. 
This composite follow-up guide was scored inde- 
pendently by two different scorers at sessions 
which averaged 25 days apart. The time elapsed 
between the two follow-up interviews was unin- 
tentional, as it was due to delays in scheduling 
the second interview. Table I presents the cor- 
relations . 

TABLE I: Correlation Coefficients from the 
Original Reliability Study 



COMPARISON 


N = 44 


COEFFICIENT OF 
CORRELATION 


First Interview versus Second Interview 
(f-'.ean of scores from both Goal Attain- 
z.^^nt Follow-up Guides from the first 
interview corpared to rceans of both 
scores from the second interview.) 


.704 


Intake Interviewer Follow-up Guide, 
First Irterview versus Second Interview 


.711 


■ Tnerapist Follow-up Guide, 

1 First Interview versus Second Interview 


.625 



When the results of this study are analyzed 
in terms of variance components, it is estimated 
that 18% of the variance is due to follow-up in- 
terviewer errors in scoring, 17% of the variance 
is due to the choice of material on the follow- 
up guide itself, 15"Z of the variance is due to 
the effects of time or circumstance difference 
between the two follow-up interviews, and the re- 
maining 50% of the variance is due to client out- 
come. 
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a. The Interdisciplinary Reliability Study 

This study of scorer accuracy is described 
more extensively elsewhere (see chapter on 
follow-up In P.E.P. Report 1969-1973 ). The 
study's basic goals were to compare scoring of 
the Goal Attainment Follow-up Guide by two 
methods (telephone versus in-person Interviewers) 
and by two different types of interviewers 
(M.S.W. versus R.N.). In ten months, 60 
clients at the Hennepin County Mental Health 
Service were interviewed twice. The assign- 
ment of follow-up method and type of inter- 
viewer were random. There was a mean inter- 
val of 27 days between a client's first and 
second interview. Table II presents the cor- 
relations obtained for the various interviewer 
combinations. None of the differences in mean 
scores reached the p < .05 level of significance. 
(Data from this study is based on Audet.te's 
chapter on follow-up procedures in the P.E.P . 
Report 1969-1973 .) 



TABLE II: 



Inter-Scorer Correlation Coefficients 
for the Interdisciplinary Study 



COMPARISO'^ 


Of 
CA5LS 


Co::F<i:iATinNS 


INTERViFW 


GOAL 

ATTAINr'iNT 

SCOKF 


GOAL 

Attainment 

SCORE 

STANDARD 

DEVIATIOtI 


Tot«il . rf r'.t 
S«":u.mJ IttLervit-W 


60 


.£16 


First 


CO. 7 


11.7 




p < .O 'l. ?- 

^--'iic^iL 


Second 


5?. 4 


12.7 


Both IntcrvJf-wo t>y 
M.S.W. 'i, HrM vi 
Second IntervU'^ 


13 


0 < .0?, ?- 
tafl'.-tj) 


first 


51. B 


16.0 




Second 


50.1 


16.5 


Both Intrrvic***. t/ 
R.N.'s, Kirit vs 
Second InttTvifrf 


10 


.570 

(not 

signi f1 canL) 


First 


50.3 


n.7 




St.-cond 


57.9 


a. 9 


first Interviv- by 
R.ri., joccnd Ir.tor- 
vifw bj M.S.Ii'. 


IS 


.759 
(iiqn. dt 
s < .001. 2- 


f irst 


50.0 


10.0 




Second 


yd. 2 


10.0 


first Itit«.'r . ie,.- ^y 
M.S.W. . Scccnrf 
In:ervurw by 


19 




First 


48. 8 


10.0 




(siqn. at 
■> < .01 . ?. 


Second 


SI. 2 


12.4 



In the original reliability study, all inter- 
views were by experienced social workers. In the 
"interdisciplinary" study, even with variation of 
follow-up method and type of interviewers, the re- 
liability coefficients are similar. 



C. Internal Consistency Indicators 

Internal consistency is not essential to an 
understanding of reliability of Goal Attainment 
Scaling, but the internal consistency measures 
Illustrate interesting psychometric features 
of the Goal Attainment Scaling methodology. Two 
analyses illustrating features of the Goal At- 
tainment Scaling internal consistency are pre- 
sented in this sub-section. 

1. Correlations Between the Individual 
Scale Scores and Total Follow-up Guide 
Scores 

The item/ total correlation which is often 
used to suggest the degree of internal consis- 
tency can be adapted to Goal Attainment Scaling. 
Since there are so few items on any one Goal 
Attainment Follow-up Guide, one hundred Goal 



Attainment Follow-up Guides were selected ran- 
domly from the follow-up guides which had been 
scored for the Program Evaluation Project "four 
mode study". For these 100 follow-up guides, a 
correlation was calculated betw.^en the scale-by- 
scale score (with a range of 1 to 9) rr each of 
the 344 scales and the overall Ki resuk -oherman 
Goal Attainment score for the Goal Attainment 
Follow-up Guide on which that scale was contained. 
For example, if a follow-up guide contained three 
scales, one scored four, one scored five and one 
scored six, the following pairs would be formed: 
(4,50), (5, 50) and (6,50). 

Such a pairing was carried out for all 344 
scales in the sample, and these 344 pairs were 
correlated to produce the single coefficient of 
.693. The mean scale score was 5.34 and the mean 
cverall, Kiresuk-Sherman Goal Attainment Follow-up 
Guide score was 52.19 (Data from calculations by 
C. Jaspercon anc* J. Baxter). This coefficient sug- 
gests a moderately high correlation on the average 
between any single scale score and the correspond- 
ing score on the Goal Attainment Follow-up Guide 
which contains it. Thus, to know the results of 
scoring one scale on the Goal Attainment Follow-up 
Guide, allowed a fairly good prediction of the 
total Kiresuk-Shenman score. 

2. Correlations between the Individual 
Scale Scores and Total Follow-up Guide 
Scores, by Number of Scales 

A second data analysis was perfonmed on all 
634 Goal Attainment Follow-up Guides scored for 
the main Program Evaluation Project study. These 
follow-up guides Include a total of 2173 scales for 
a mean of 3.43 scales per Goal Attainment Follow-up 
Guide. In this analysis, the correlation described 
above, (between the score on the individual scale 
and the overall Kiresuk-Shenman Goal Attainment 
score for the follow-up guide of which the scale was 
part) was calculated separately for Goal Attainment 
Follow-up Guides with differing numbers of scales, 
so as to remove any confusing effect of varying num- 
bers of scales per Goal Attainment Follow-up Guide. 
This precaul-'on was not applied to the preliminary 
scale/total analysis described above. 

All the correlation coefficient*; as shown on 
Table III are statistically significant at better 
than the p < .01 level ( two-tai led) . 

TABLE III: Correlations Between Scale Scores and 
Overall Goal Attainment Scores for 
644 Cases, Divided by the Number of 
Sceles per Follow-up Guide. 



Number 
Scales 
Follow- 
Guide 


of 

per 

up 


Number of 
Follow-up 
Guide * 


Number 

of 
Scales 


Mean 

Scale 

Score 


Mean 
Goal 

Attainment 
Score 


r 


1 




12 


12 


6.50 


57.5 


.98 


2 




85 


170 


5.34 


52.0 


.78 


3 




262 


786 


5.35 


52.4 


.68 


A 




180 


720 


5.10 


50.9 


.66 


5 




88 


440 


5.06 


?0.7 


.64 
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The correlations for the 4 Goal Attainment 
Follow-up Guides with 6 scales and the 3 
Goal Attainment Follow-up Guides with 7 
scales are not included because of the small 
size of the samples. 



This analysis suggests tnat for follow-up guides 
with two to five scales, any one scale score will 
tend to have a moderately high correlation with 
thu overall Goal Attainment score for that follow- 
up guide. Although this internal consistency in- 
dicator is applied in a modified manner, it does 
suggest that there is a meaningful degree of co- 
hesion betv\'^en scale scores and total follow-up 
guide scores. 



D. Inter- rater Reliability as Measured b y Rati^ng 
o f the Videotape Interviews 

1. Six-Rater Study 

Thr^e graduate students in social work ob- 
served the videotape of an intake interview 
with a 22 year old female client as part of a 
Program Evaluation Project training program. 
After this observation, they each independently 
constructed a Goal Attainment Follow-up Guide 
for the client. These were the first follow-up 
guides constructed by them. 

At u later training session, these three 
graduate students plus three others watched the 
video tape of a follow-up with that same client. 
Each of these six persons scored each follow-up 
guide. 

Two of these student-constructed follow-up 
guides contained four scales and the third fol- 
low-up guide contained three scales, for a total 
of eleven scales. Thus, there were eleven scales 
scored six times each, except for two instances 
in which a scale was not scored by the raters 
for unknown reasons, so that there were 64 
scale scores. Clearly, this is not a typical 
application of Goal Attainment Scaling. It 
suggests, however, the reliability possible 
with very inexperienced Goal Attainment Follow- 
up Guide constructors and follow-up raters. 

Content agreement on the three follow-up 
guides is high, as Figure II illustrates. "Self- 
concept" appears on all three fo*" low-up guides. 
Vocational problems appear in two follow-up 
guides, as do both marriage and sexaulity prob- 
lems. 

FIGURE II: Scale Headings for Three Different 
Follow-up Guides for the Same Client 
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As Table IV reveals, the fifteen inter-rater 
percentages of complete agreement on the indi- 
vidual scale scores range from 36 percent to 
73 percent, with a mean percentage of complete 
agreement of 56 percent. The percentages of 
scale scores on which the two raters' scores 
are within one goal attainment level of each 
other ranges from 60 percent to 100 percent, with 
a mean percentage of 83 percent. 

TABLE IV: Inter-Rater Percentage of Agreement for 
Six Raters Agreement Expressed in Per- 
centages of Scales* 



Rater J\to 





A 


8 


c 


0 


E 


A 


00% 
(90%) 
N--10 


\ / 
/\ 








D 


60% 
(90:0 
N=10 




X 






C 


SC% 
(90?) 
H=10 


AS% 

(9u: 


72% 
(9IS) 






D 


AA% 


602 
(BOX) 
N«10 


50% 
(60%) 
N^IO 


70% 
(70%) 
N=10 


X 


I 


60% 
(90%) 
N=10 


26% 

(9n) 


64% 
(91%) 


73% 
(100%) 


50% 
(60%) 
N-10 



*ror each pair of raters th'ire are two percentages: 
the first percentage shows the percentage of scales 
on which thnre was complete agreement, and the second 
Percentage, which is in parentheses, shows the per- 
centage of scales in which the two raters' scores 
were no mort; than one level apart. 

2. Four-Rater Study 

A set of six follow-up guides was constructed 
by six Hennepin County Mental Health Service 
clinicians, shortly after the video tape for the 
client mentioned previously was recorded. This 
study reports on a session on January 19, 1973 
in which one experienced Goal Attainment Scaling 
rater (from the Program Evaluation Project staff) 
and three inexperienced Goal Attainment Scaling 
raters scored one set of the 29 scales included 
in these six follow-up guides, after observing 
the videotape of the follow-up interview with 
client P.R. 

As Table V shows, the mean follow-up quide 
score per rater varies considerably. This vari- 
ation suggests that there is a meaningful dif- 
ference or bias among the raters in their over- 
all level of scoring the follow-up guides de- 
spite the high percentage of agreement on any 
one scale. 
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TABLE V: Sum of Scale Scores** for Six Fonow-up Guides Based on the Videotape as Scored by Four Raters 





Fol low-up 
Guide 1 
f4 Scales) 


Follow-up 
Guide 2 
(6 Scales) 


Follow-up 
Guide 3 
(5 Scales) 


Fol low-up 

G'ji de 4 
(5 y.es) 


Fol lov/-up 
(»el) 


Follow-up 


Mean Follow-up 
Score per Rater 


Rater A 


3 


Q 




1 
1 


J 


3.5 


1 . 92 


Rater B 


-2 


-7 


-1.5 


0 


-3 


1* 


-2.08 


Rater C 


-3 


-5.5 


- .5 


1 


0 


3.5 


- . /3 


Riter 0 


-1 


2 




-1 


1 


2 


.35 


Mean 

Score per 
Follow-up 
Guide 


-.75 


-2.63 


- .50 


.25 


.25 


2.5 





• Only one of the four scales was scored, for an unknown reason. 

• Oased on a possible range of -2 to +2 for each scale. 



Table VI incorporates the data from Table V 

in the form of percentages of agreement, where 

the percentage is based on the number of raters 
agreeing on the most common response. 

TABLE VI: Percentages of Agreement for Twenty-n^' 
Scales, Scored by Four Raters 



Fol low-up 


Scale 


Host 


Per Cent 


Guide 


Nunber 


Common Score 


Agreeing 




1 


-1 


5 OX 


Fol low-up 


2 


-1 


100% 


Guide 1 


3 


0 


75? 




4 


0 


75X 




5 








6 


-1 


sot 


Follow-up 


7 


-2 


755 


Guide 2 


8 


0 


75X 




9 


-1 


1002 




10 




Oi 




11 


-1 


5W 


follow-up 


12 






Guide 3 


13 


0 


75% 




14 


0 


751 




15 


0 


75Z 




16 


0 


50r 


Fol low-ur> 


17 


-1 


100% 


Guide 4 


18 


-2 


loot 




19 


1 


7Si 




20 


2 


1002 




21 


1 and 0 


501 


Fol low-up 


22 


2 


50t 


Guide 5 


23 


0 


502 




24 


-1 and -2 


502 




25 


-.5 


502 


Fol low- up 


26 


2 


502 


Guide 6 


27 


0 


752 




28 


1 


752 




29 


1 


1002 
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The mean of these twenty-nine percentages 
of agreement is 66.7. There was complete agree- 
ment on six scales, 75 percent agreement on 
ten scales, 50 percent agreement on ten scales 
and 0 agreement (i.e., all four ratings were 
different) on three scales. 

However, even with this moderately high per- 
centage of agreement, the sum of all the scale 
scores varied greatly among the four raters. 



Inter-rater Reliability in Two Goal Attain- 
ment Follow-up Guides Constructed to Evalu- 
ate Organizational Goal Achievement 

The idea of applying Goal Attainment Scaling 
to organizations is an addition to the original 
concept of evaluating individual clients. How- 
ever, the concepts of organizational goal-set- 
ting and individual goal-setting are basically 
similar. The inter-rater reliability of two 
such organizational applications are presented 
below. 

1. A Twenty-three Scale Goal Attainment 
Follow-Up Guide 

A Goal Attainment Follow-up Guide was con- 
structed for the Program Evaluation Project in 
1971 by the supervisory staff. There were 23 
scales utilized in follow-up scoring. 

This twenty-three scale Goal Attainment 
Follow-up Guide was scored by eleven persons from 
all levels of the Program Evaluation Project staff 
Not every scale was scored by all eleven because 
raters were given the option of not registering a 
rating if they considered the scale to be unscore- 
able. 
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The mean percentage of agreement for these 
scales is 66.9%, which is very similar to the 
earlier mean percentage of agreement 66.7% cal- 
culated for 29 scales in the four-rater study 
reported in section D-2. 

2. A Sixty Scale Goal Attainment Follow- 
up Guide 

A set of 60 scales was constructed by the 
Program Evaluation Project supervisory staff 
for a 1972 evaluation of the organization's 
goal attainment. The resulting Goal Attainment 
Follow-up Guide wa^ scored by seven Program 
Evaluation Project staff meirbers and, as in the 
earlier study, they were given the option of 
not rating a scale. For some scales, there was 
high agreement that the scales were linscoreable, 
and where four or more staff members concur in 
rating a scale "unscoreable" the number of "un- 
scoreable" ratings is used to calculate the per- 
centage of agreement. 

The mean percentage of agreement for these 
60 scales was 68.1, which is similar to the 
other percentages of agreement reported above. 



F. Reliabil 'i ty of Scores When Multiple Fol low- 
Up Guides are Constructed for the Same 
Client 

The original reliability study was based on 
a design in which two Goal Attainment Follow-up 
Guides were made for each client. One follow-up 
guide was constructed by the therapist and one was 
constructed by the intake interviewer. There 
was a mean of 25 days between construction of 
the two follow-up guides, ihe two follow-up 

?uides were developed in two different settings 
i.e., intake versus therapy interviews). In 
the studies reported below, which are based on 
the videotape, several follow-up guides could 
be based on the single, recorded intake inter- 
view, thus minimizing or eliminating effects 
of different time$ and settings. However, since 
the follow-up guide constructors were inexper- 
ienced and the videotape interview is obviously 
a different situation than a live interview, 
the results are not directly comparable. 



1 . The Six-Rater Study 

In this study, as presented below, (see 
section D-1) three Goal Attainment Follow-up 
Guides were constructed afte the raters ob- 
served the videotape on client P.R. The fol- 
low-up guide constructors were inexperienced, 
both as constructors and clinicians. Sim- 
ilarily, the six persons who scored these three 
Goal Attainment Follow-up Guides had not pre- 
viously rated any follow-up guides of this type. 
Thus, these results reflect the reliability of 
follow-up guide construction and of follow-up 
guide scoring without the benefits of experience 
except for minimal training in Goal Attainment 
Scaling (approximately one and one half hours 
training in follow-uD guide scoring). 



For each of these independently constructed 
follow-up guides, the mean score per scale was cal- 
culated and appears on Table VII below. 

TABLE VII: Mean per Scale Ratings* for Three 
Follow-up Guides Based on the 
Videotape and Scored by Six Raters 



Rater 


Guide 1 


Guide 2 


Guide 3 


A 


-1.33 


0 


- .33 


B 


- .25 


.75 


0 


C 


-1.00 


.25 


- .67 


D 


-1.00 


.75 


- .67 


E 


.50 


1 ,33 


0 


F 


-1.00 


.75 


- .67 



* Based on a -2 to +2 possible range for a 
single scale. 



The correlations among the six ratings for 
each follow-up guide appear on Table VIII below. 
These are the intercorrelations (Pearson Product 
Moment) of the columns in Table VII. 

TABLE VIII: Correlations Among the Mean Per 
Scale Scores for Six Independent 
Scores on Three Follow-up Guides 



COLUMN'S 
COKULLATFD 


COLFFICILNT , 


Follow-up Guide 1 
and 

Fol lov;-up Guide 2 


.760 


Fol low-up Guide 1 
and 

Fol low-up Guide 3 


.765 


Fol low-up Guide 2 
and 

Fol low-up Guide 3 


-.234 






Mean of the Absolute 
Value of the Three 
Correlations 


.586 
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The mean absolute value of the correlation 
coefficient is .586 which suggests the stabil- 
ity of the mean scale score per follow-up guide 
construction even for the very inexperienced 
constructors in this example involving a video- 
tape case study of one client. The range of 
correlations among the pairs of rcters, from 
-.234 to .765 suggests that individual differ- 
ences in agreement among the follow-up guide 
constructors are quite high. The .760 correla- 
tion for follow-up guide constructor 1 versus 
follow-up guide constructor 2, and .765 corre- 
lation for follow-up guide constructor 1 versus 
follow-up guide constructor 3 are comparable to 
the correlation coefficients calculated for data 
of the original reliability study. (See Table I.) 



G. Reliability o^ ' Different Follow-up Raters 



1. Client Scoring versus Fol low-Up Inter- 
viewer Scoring for Client-Constructed 
Fol low-Up Guides (Guide to Goals Study) 

The Guide to Goals, Format One, can be used 
to enable clients to construct their own follow- 
up guides. In the Day Treatment Center of the 
Hennepin County Mental Health Service, clients 
were asked to use the Guide to Goals. 

These client-produced follow-up guides were 
scored independently by I) the client and 2) 
by the foPow-up interviewer at an interview 
scheduled four months after the date at which 
the follow-up guide was constructed. The inter- 
viewers were from the regular Program Evaluation 
Project follow-up staff. Goal Attainment scores, 
one from the client and the other from the ther- 
apist, were correlated. The Guide to Goals 
study was divided into two phases, and correla- 
tions from the cases first followed-up were 
published in the Program Evaluation Project 
Newsletter in Volume IV, issues 4 and 6 (Jones 
and Garwick). 

In the report for phase one, scores from 
ten cases were available. The clients' ratings 
and the interviewers' ratings were correlated 
(Pearson Product Moment) with a coefficient of 
.71, which is statistically significant at the 
p < .01 level, two- tailed. 

In phase two, only half of the clients were 
invited to construct their own follow-up guides. 
Interviewer and clienc-rated Goal Attainment 
scores for seven client-constructed follow-up 
guides were correlated at .733 which is sta- 
tistically significant at the p<.03 level, two- 
tailed. The means were 71.6 for the client- 
rated Goal Attainment score and 69.9 for the 
interviewer-rated Goal Attainment score. 

In these two studies, both ratings took 
place at the same session. The inter-rater re- 
liability is comparable to the reliability co- 
efficient obtained from other studies, even 
though the follow-up guides were prepared by 
persons so incapacitated that they sought as- 



sistance at the Mental Health Day Treatment Center, 
and who were constructing their first follow-up 
guide. 

2. Therapist Scoring Versus Follow-up Worker 
Scoring 

This study involved the therapists' scoring 
of follow-i.'v.^ guides which had been constructed 
at the Hennepin County Mental Health Service by 
intake interviewers: . (Baxter, 1973) After a 
follow-up interviewer had scored the follow-up 
guide and returned it to the Program Evaluation 
Project stcf^". Che follow-up guide was given 
to the therc::iiit for scoring, before the ther- 
apist saw the follow-up interviewer's scoring. 
The therapist did not_ interview the client when 
scoring the Goal Attainment Follow-up Guide. 
In practice, there was a sizeable delay of at 
least one or two weeks between these two ratings, 
although the actual length of time is not speci- 
fied in the original report. Under these con- 
ditions, the correlations between therapist and 
follow-up interviewer ratings tend to be quite 
modest for the Adult Outpatients and quite high 
for the Day Treatment clients. (See Table IX.) 

TABLE IX: Pearson Correlation Between Goal Attain- 
ment Scores for Clients Based on Two 
Independent Interviews, One by Therapist 
and One by Follow-up Interviewer 



Type of Client 


N 


Correlation 


Adult Outpatient Clients 


N = 53 


.507 


Day Treatment Clients 


N = 8 


.848 



H. i^eliability Over Time (The Drug E ffectiveness 
Study) 

Thp Drug Study procedures are described in 
anoth • chapter of the P.E.P. Report 1969-1973 . 
However, the study involved three fol low-up in- 
terviews per client. One interview was held 
three weeks after construction of the follow-up 
guide, the second was two months after construc- 
tion, and the third was six months after con- 
struction. The follow-up guides were focused 
most directly on the two month follow-up inter- 
view. Follow-up interviews were conducted by 
experienced masters degree social workers on the 
Program Evaluation Project follow-up staff. 

As the data in Table X on the following page 
suggest, there appears to be a trend for the mean 
Goal Attainment scores to rise as the time after 
follow-up increases. However, the rise in means 
would not necessarily be related to changes in re- 
liability, if all cases increased proportionately. 
Table X shows the correlations of scores for the 
follow-up dates. 
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TABLE X: Mean Goal Attainment Scores for Three 
Fol low-up Sessions 





Goal Att.' ii.-'^nl Score 


Deviat jfr. 




lh"-L W(..k'. Af'.ci' 


51.17 


10.8 


18 


Con:trjLt !. T 




11.0 


17 
IG 


Si> f'r.nV. .'>f:cr 
Cor.'. trui 1: 0.1 


CI. 3 


10.1 



As the correlations in Table XI below reveal, 
there is a relatively low degree of correlation 
among the Goal Attainment scores from the three 
follow-up times. The mean correlation is only 
.46, which is lower than correlations typically 
obtained from multiple follow-ups at the same 
follow-up session or from two follow-ups by 
different follow-up workers. This somewhat 
lower correlation may imply that the degree of 
attainment of expectations does change consid- 
erably within a six month period, that is, that 
knowledge of goal attainment at one time is not 
always sufficient for prediction of goal attain- 
ment at other times. It appears that in addition 
to the rise in mean Goal Attainment scores over 
time, there is a fairly high amount of change in 
the relative degree of goa"' attainment over Vne 
among the clients. 

TABLE XI: Correlations of Ki resuk-Sherman Goal 
Attainment Scores for Three Follow-up 
Interviews 



Cor.'>truCtlCf« 



fol low-Dp Tv 3 

r*on:r.\ A. 
Coritru'.t 



rollQw-up Two Months After 
rollcw-vo Cuiilc Construction 



.70 (N=W) 
(P<.001, two-tailP'J) 




FollOH-up Six fonths After 
follow-up GuHc Construction 



.20 (N=1M 



(not statistically srjnificart 
at the p < .03 Icvt'l } 



.47 (tl'lb) 
(p-.OS, two-t.^iUd) 



III . An Overview of Correlational Estimates of 
Goal Attainment Scaling Reliability 

As emphasized in the introductory comments, 
the complexity of reliability estimation should 
now be clearer. Goal Attainment Scaling has 
many points at which the reliability of the tech- 
nique can be altered. These points include, at 
least: 

a. who constructs the follow-up guide? 
(client, therapist or intake inter- 
viewer, etc. ) 



b. how many scales appear on the fol- 
low-up guide? 

c. when is the follow-up guide scored? 
(how long after follow-up guide con- 
struction?) 

d. who scores the follow-up guide? 
(are they experienced in such scor- 
ing? what background do they have?) 

e. how many raters are there? 

f. in what circumstances are the follow- 
up guides scored? (in-person, by 

mai 1 , by telephone) 

Slice Goal Attainment Scaling is a basic ap- 
proach which encompasses an array of possible 
technical variants, a whole galaxy of possible 
reliability estimates could be helpful for dif- 
ferent applications. 

The reliability estimates discussed in this 
chapter were produced in a variety of situations 
but most suggest a usefully high degree of inter- 
rater agreement, and a usefully high degree of 
Goal Attainment score stability for clients when 
follow-up guides are mat^e by more than one person. 
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